当我们扩大数据集,模型尺寸和培训时间时,深入学习方法的能力中存在越来越多的经验证据。尽管有一些关于这些资源如何调节统计能力的说法,但对它们对模型培训的计算问题的影响知之甚少。这项工作通过学习$ k $ -sparse $ n $ bits的镜头进行了探索,这是一个构成理论计算障碍的规范性问题。在这种情况下,我们发现神经网络在扩大数据集大小和运行时间时会表现出令人惊讶的相变。特别是,我们从经验上证明,通过标准培训,各种体系结构以$ n^{o(k)} $示例学习稀疏的平等,而损失(和错误)曲线在$ n^{o(k)}后突然下降。 $迭代。这些积极的结果几乎匹配已知的SQ下限,即使没有明确的稀疏性先验。我们通过理论分析阐明了这些现象的机制:我们发现性能的相变不到SGD“在黑暗中绊倒”,直到它找到了隐藏的特征集(自然算法也以$ n^中的方式运行{o(k)} $ time);取而代之的是,我们表明SGD逐渐扩大了人口梯度的傅立叶差距。
translated by 谷歌翻译
在机器学习中,我们传统上评估单个模型的性能,平均在测试输入集合中进行平均。在这项工作中,我们提出了一种新方法:在$ \ textit {单个输入点} $上评估时,我们测量了模型集合的性能。具体来说,我们研究了一个点的$ \ textit {profile {profile} $:模型在测试分布上的平均性能与他们在该点上的角度表现之间的关系。我们发现配置文件可以在分布和分发的模型和数据的结构中产生新的见解。例如,我们从经验上表明,实际数据分布由具有质量不同的点组成。一方面,有“兼容”点,在角度和平均性能之间具有很强的相关性。另一方面,有些点具有弱甚至$ \ textit {nogate} $相关性:提高整体模型精度实际上$ \ textit {hurts} $性能的情况。我们证明,这些实验观察与先前工作中提出的几种简化学习模型的预测不一致。作为一个应用程序,我们使用配置文件来构造一个数据集,我们称为CIFAR-10-NENG:CINIC-10的子集,因此对于标准模型,CIFAR-10-NENG上的准确性为$ \ textit {negalissiper {negalissiperational {negalishatied} CIFAR-10测试。这首先说明了一个完全逆转“准确性”的OOD数据集(Miller,Taori,Raghunathan,Sagawa,Koh,Koh,Shankar,Liang,Carmon和Schmidt 2021)
translated by 谷歌翻译
We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers of machine learning models from the burden of keeping track of the order of axes and the purpose of each. The notation makes it easy to lift operations on low-order tensors to higher order ones, for example, from images to minibatches of images, or from an attention mechanism to multiple attention heads. After a brief overview and formal definition of the notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and LeNet. We then discuss differential calculus in our notation and compare with some alternative notations. Our proposals build on ideas from many previous papers and software libraries. We hope that our notation will encourage more authors to use named tensors, resulting in clearer papers and more precise implementations.
translated by 谷歌翻译
We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance. * Work performed in part while Preetum Nakkiran was interning at OpenAI, with Ilya Sutskever. We especially thank Mikhail Belkin and Christopher Olah for helpful discussions throughout this work.
translated by 谷歌翻译
Inertial and Doppler velocity log sensors are commonly used to provide the navigation solution for autonomous underwater vehicles (AUV). To this end, a nonlinear filter is adopted for the fusion task. The filter's process noise covariance matrix is critical for filter accuracy and robustness. While this matrix varies over time during the AUV mission, the filter assumes a constant matrix. Several models and learning approaches in the literature suggest tuning the process noise covariance during operation. In this work, we propose ProNet, a hybrid, adaptive process, noise estimation approach for a velocity-aided navigation filter. ProNet requires only the inertial sensor reading to regress the process noise covariance. Once learned, it is fed into the model-based navigation filter, resulting in a hybrid filter. Simulation results show the benefits of our approach compared to other models and learning adaptive approaches.
translated by 谷歌翻译
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. However, standard methods for evaluating these metrics have yet to be established. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics, allowing sensible comparison of their behavior. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by evaluating them on a collection of classical and state-of-the-art metrics. Our measures revealed that recently-developed metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.
translated by 谷歌翻译
The field of emergent communication aims to understand the characteristics of communication as it emerges from artificial agents solving tasks that require information exchange. Communication with discrete messages is considered a desired characteristic, for both scientific and applied reasons. However, training a multi-agent system with discrete communication is not straightforward, requiring either reinforcement learning algorithms or relaxing the discreteness requirement via a continuous approximation such as the Gumbel-softmax. Both these solutions result in poor performance compared to fully continuous communication. In this work, we propose an alternative approach to achieve discrete communication -- quantization of communicated messages. Using message quantization allows us to train the model end-to-end, achieving superior performance in multiple setups. Moreover, quantization is a natural framework that runs the gamut from continuous to discrete communication. Thus, it sets the ground for a broader view of multi-agent communication in the deep learning era.
translated by 谷歌翻译
我们建议第一个通过对弱的微型计算机进行深入学习的实时语义细分的系统,例如Raspberry Pi Zero Zero V2(其价格\ 15美元)附加到玩具无人机上。特别是,由于Raspberry Pi的重量不到$ 16 $,并且其大小是信用卡的一半,因此我们可以轻松地将其连接到普通的商业DJI Tello玩具器中(<\ $ 100,<90克,98 $ \ \时间$ 92.5 $ \ times $ 41毫米)。结果是可以从板载单眼RGB摄像头(无GPS或LIDAR传感器)实时检测和分类对象的自动无人机(无笔记本电脑或人类)。伴侣视频展示了这款Tello无人机如何扫描实验室的人(例如使用消防员或安全部队)以及在实验室外的空停车位。现有的深度学习解决方案要么在这种物联网设备上实时计算要么太慢,要么提供不切实际的质量结果。我们的主要挑战是设计一个系统,该系统在网络,深度学习平台/框架,压缩技术和压缩比的众多组合中占有最好的选择。为此,我们提供了一种有效的搜索算法,旨在找到最佳组合,从而导致网络运行时间与其准确性/性能之间的最佳权衡。
translated by 谷歌翻译
我们研究在计算和通信约束下分布式设置中高维稀疏线性回归的问题。具体来说,我们考虑了一个星形拓扑网络,该网络将几台机器连接到融合中心,他们可以与他们交换相对较短的消息。每台机器都有来自线性回归模型的嘈杂样品,该模型具有相同的未知稀疏$ d $ - 维数二维矢量$ \ theta $。融合中心的目标是使用几乎没有计算和有限的通信在每台机器上估算矢量$ \ theta $及其支持。在这项工作中,我们考虑基于正交匹配追求(OMP)的分布式算法,并理论上研究了他们精确收回$ \ theta $的支持的能力。我们证明,在某些条件下,即使在单个机器无法检测到$ \ theta $的支持下,分布式式方法在$ \ theta $的支持下,在$ d $中的总通信sublinear中正确恢复了它。此外,我们提出的模拟说明了基于分布式OMP的算法的性能,并表明它们的性能类似于更复杂和计算密集的方法,在某些情况下甚至表现优于它们。
translated by 谷歌翻译
给定一个Polygon $ W $,将深度传感器放置在$ w $内部$ p =(x,y)$的深度传感器,并向方向定向$ \ theta $测量距离$ d = h(x,x,y,\ theta)$ $ p $和$ w $边界上的最接近点之间的射线散发出$ p $ in Doriess $ \ theta $。我们研究以下问题:给出一个多边形$ w $,可能带有漏洞,带有$ n $顶点,使其进行预处理,以便给定查询实际值$ d \ geq 0 $,一个人可以有效地计算preimage $ h^{ - 1}(d)$,即确定放置在$ w $中的深度传感器的所有可能的姿势(位置和方向),这些传感器将产生读取$ d $。我们采用$ w \ times s^1 $的分解,这是著名的梯形分解的延伸,我们称之为旋转梯形分解并呈现有效的数据结构,并以相对于输出敏感的方式计算出预先映射的数据结构这种分解:如果分解的$ k $单元有助于最终结果,我们将以$ O(k+1)$ time报告它们,之后$ O(n^2 \ log n)$ preadocessing时间并使用$ o (n^2)$存储空间。我们还分析了预映射到多边形$ w $的形状;该投影描述了传感器可以放置的$ W $的部分。此外,我们获得了更有用的情况(缩小可能的姿势集)的类似结果,其中传感器从同一点$ p $,一个方向$ \ theta $进行两个深度测量,另一个朝向方向$ \ \ \ \ \ \ theta+\ pi $。虽然机器人技术中的本地化问题通常是通过探索放置在环境固定点的传感器的完整可见性多边形来实现的,但我们在这里提出的方法仅需少量的深度测量,这是有利的,因为它允许,这是有利的用于使用廉价的传感器,也可能导致存储和通信成本节省。
translated by 谷歌翻译